We decided to work on one of the most burning time series problem of today’s day and era, “predicting web traffic”. We believe that this forecasting can help website servers a great deal in effectively handling outages. The technique we implemented can be extended to diverse applications in financial markets, weather forecasts, audio and video processing. Not just that, understanding your website’s traffic trajectory can open up business opportunities too!
The dataset consists of 145k time series representing the number of daily page views of different Wikipedia articles, starting from July 1st, 2015 up until September 10th, 2017 (804 data points). The goal is to forecast the daily views between September 13th, 2017 and November 13th, 2017 (64 data points) for each article in the dataset.Traditional moving averages, ARIMA based techniques
Let’s import the important libraries
# visualization
library('ggplot2')
# data manipulation
library('dplyr')
library('readr')
library('imputeTS')
#time series
library('fpp')
library('forecast')
library('xts')
library('zoo')
library(prophet)
A dataset like this would have a lot of interventions because the web traffic views are dependent on external factors and can spike up on one day and return to the same level in a while. So let’s start with searching a nice stationary model wave where we can use our ARIMA methods, and then we can start working on interventions
# importing the dataset
train_1 <- read.csv("~/Desktop/UChicago/Quarters/03-Quarters/Data/TS/web-traffic-time-series-forecasting/train_1.csv", header = TRUE, row.names = 1,sep = ",",skip =0)
We found one dataset that appears more or less stationary, let’s have a look
# importing the dataset
ASCII <- data.matrix(train_1[c("ASCII_zh.wikipedia.org_all-access_spider"),])
dimnames(ASCII)<-NULL
ASCII<-array(ASCII)
plot(ASCII,type='l')
The time series appears to have increasing trend with no signs of seasonality
Let’s create a time series object and split the data into test and train. We would use data from 2015-07-01 to 2016-08-31 as train data and data from 2016-09-01 to 2016-12-31 as test data.
time_index <- seq(from = as.POSIXct("2015-07-01"), to = as.POSIXct("2016-12-31"), by = "day")
ASCII_ts <- xts(ASCII, order.by =time_index ,frequency = 365.25)
tsdisplay(ASCII_ts,ylab="ASCII daily traffic",xlab="Day")
ASCII_train<-ASCII_ts['2015-07-01/2016-09-30']
ASCII_test<-ASCII_ts['2016-10-01/2016-12-31']
tsdisplay(ASCII_train,ylab="ASCII daily traffic",xlab="Day")
tsdisplay(ASCII_test,ylab="ASCII daily traffic",xlab="Day")
length(ASCII_test)
## [1] 92
We can see from ACF and PACF that the dataset isn’t dying down fast which means that the dataset isn’t stationary and hence we would need differencing. Let’s have a look at that as well.
Let’s look if we need to do Box Cox transformation to decouple mean and variance
BoxCox.lambda(ASCII_train)
## [1] 0.3897714
Yes, we need to perform a transformation with lambda = 0.3922381
Also confirming our assumption that the data isn’t stationary
kpss.test(ASCII_train)
##
## KPSS Test for Level Stationarity
##
## data: ASCII_train
## KPSS Level = 5.6063, Truncation lag parameter = 5, p-value = 0.01
Since the p value is less than 0.05, the data isn’t stationary
Let’s apply one level of differencing and check
ASCII_train_boxcox<-ASCII_train %>% BoxCox(lambda = BoxCox.lambda(ASCII_train))
ASCII_train_diff <- diff(ASCII_train_boxcox)
kpss.test(ASCII_train_diff)
##
## KPSS Test for Level Stationarity
##
## data: ASCII_train_diff
## KPSS Level = NaN, Truncation lag parameter = 5, p-value = NA
tsdisplay(ASCII_train_diff)
The data is better now, and the looks stationary. Let’s start with naive forecasts first before we movie into ARIMA model
#forecast horizon
h<-92
#naive forecasts
ASCII_train_new<-ts(ASCII_ts['2015-07-01/2016-09-30'])
ASCII_test_new<-ts(ASCII_ts['2016-10-01/2016-12-31'])
#evaluating the models
Model_Mean <- meanf(ASCII_train_new, h)
Model_Naive <- naive(ASCII_train_new, h)
Model_Drift <- rwf(ASCII_train_new, h, drift=TRUE)
#Naive forecast
autoplot(ASCII_train_new) +
autolayer(Model_Mean$mean, series="Mean") +
autolayer(Model_Naive$mean, series="Naive") +
autolayer(Model_Drift$mean, series="Drift") +
ggtitle("Forecasts for daily ASCII Wikepedia Page") +
xlab("Days") + ylab("ASCII traffic")
Let’s have a look at the metrics - Out of sample metrics-test
accuracy(Model_Mean,ASCII_test)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.288732e-15 9.247481 7.477975 -35.02098 58.12142 0.3481664
## Test set 5.076182e+00 9.681532 7.009018 11.46968 23.96929 0.3263323
## ACF1
## Training set 0.5328766
## Test set NA
accuracy(Model_Naive,ASCII_test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.04157549 8.917198 6.956236 -12.684160 40.37525 0.3238748
## Test set 1.55434783 8.389305 6.228261 -3.046881 24.13922 0.2899810
## ACF1
## Training set -0.4634768
## Test set NA
accuracy(Model_Drift,ASCII_test)
## ME RMSE MAE MPE MAPE MASE
## Training set 4.791771e-16 8.917101 6.959238 -12.94458 40.45229 0.3240146
## Test set -3.789126e-01 7.981986 6.161046 -10.65661 25.72335 0.2868516
## ACF1
## Training set -0.4634768
## Test set NA
It can be seen from above metrics that drift is performing the best because the trend is increasing but not a stable model as the assumption is that it would always increase. Let’s have a look at the ARIMA model
auto.arima(ASCII_train,seasonal = TRUE,lambda = 'auto')
## Series: ASCII_train
## ARIMA(2,1,3) with drift
## Box Cox transformation: lambda= 0.3897714
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 drift
## 0.7449 -0.9047 -1.5888 1.4992 -0.8159 0.0073
## s.e. 0.0570 0.0682 0.0683 0.1202 0.0668 0.0042
##
## sigma^2 estimated as 1.153: log likelihood=-678.97
## AIC=1371.94 AICc=1372.19 BIC=1400.82
m1<-Arima(ASCII_train,lambda = 'auto',order=c(2,1,3),include.drift = TRUE)
checkresiduals(m1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,3) with drift
## Q* = 12.022, df = 4, p-value = 0.01719
##
## Model df: 6. Total lags used: 10
We can see that residuals have no autocorrelation and appear stationary. Let’s have a look at metrics
autoplot(forecast(m1,h=92))
accuracy(forecast(m1,h=92),ASCII_test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.5616526 6.780914 5.236161 -8.861843 29.96078 0.243790
## Test set -2.3523442 8.302841 6.700521 -18.718375 29.63149 0.311969
## ACF1
## Training set 0.001079399
## Test set NA
Looking at the metrics it has the lowest Train and a lower RMSE as compared to the other models.
Let’s check if there are better model using the Extended Auto correlation function
source("~/Desktop/UChicago/Quarters/03-Quarters/03-31006-TimeSeries/03-Week/eacf.r")
#differencing and box cox transforming the training data
ASCII_train_new_boxcox<-ASCII_train_new %>% BoxCox(lambda = BoxCox.lambda(ASCII_train_new))
ASCII_train_new_diff <- diff(ASCII_train_new_boxcox)
eacf(ASCII_train_new_diff)
## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x o o x x o o o x x o o o o
## 1 x x o x x o o o x o o o o o
## 2 x o o x x o o o x o o x o o
## 3 x x x x o o o o x o o o o o
## 4 x o x x x o o o x o x o o o
## 5 x x o o x x o o o o x o o o
## 6 x x o o x o o o o o x o o o
## 7 x o x x x o x x o o x o o x
Note: We don’t wish to complicate the model and hence would keep pmax and qmax to be 2,2 Trying different models from above matrix p=0,q=1 p=0,q=2 p=1,q=2 p=2,q=1 p=2,q=2
m2<-Arima(ASCII_train,lambda = 'auto',order=c(0,1,1),include.drift = TRUE)
m3<-Arima(ASCII_train,lambda = 'auto',order=c(0,1,2),include.drift = TRUE)
m4<-Arima(ASCII_train,lambda = 'auto',order=c(1,1,2),include.drift = TRUE)
m5<-Arima(ASCII_train,lambda = 'auto',order=c(2,1,1),include.drift = TRUE)
m6<-Arima(ASCII_train,lambda = 'auto',order=c(2,1,2),include.drift = TRUE)
cbind(m1$aicc,m2$aicc,m3$aicc,m4$aicc,m5$aicc,m6$aicc)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1372.194 1378.982 1376.102 1377.536 1377.471 1379.025
By Aicc values, model1 is the best. Therefore we would go with auto arima model.
m_hw <- holt(ASCII_train,h=92)
autoplot(m_hw)
m_hw_damp <- holt(ASCII_train,h=92,damped = TRUE)
autoplot(m_hw_damp)
accuracy(m_hw)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.005428262 6.854588 5.318492 -12.91236 31.87033 0.2476232
## ACF1
## Training set 0.07815376
accuracy(m_hw_damp)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.3755877 6.831301 5.282292 -11.99129 31.6822 0.2459377 0.07318879
checkresiduals(m_hw)
##
## Ljung-Box test
##
## data: Residuals from Holt's method
## Q* = 17.04, df = 6, p-value = 0.009136
##
## Model df: 4. Total lags used: 10
checkresiduals(m_hw_damp)
##
## Ljung-Box test
##
## data: Residuals from Damped Holt's method
## Q* = 17.039, df = 5, p-value = 0.004426
##
## Model df: 5. Total lags used: 10
The Ljung Box test tells us that residuals are correlated as the pvalues are 0.05 means Failure to reject null hypothesis which means that the model isn’t a good fit.
Let’s use the best model found i.e. ARIMA(2,1,3)
m7 <- function(x, h){forecast(Arima(x, order=c(2,1,3),lambda = 'auto',include.drift=TRUE), h=h)}
error_1 <- tsCV(ASCII_train, m7, h=1)
error_2 <- tsCV(ASCII_train, m7, h=1, window = 12) # Rolling/Sliding Window
Let’s have look at the errors
autoplot(error_1, series = 'Expanding Window') +
autolayer(error_2, series = 'Rolling Window')
Let’s have a look at the values as well
print(sqrt(mean(error_1^2, na.rm=TRUE)))
## [1] 7.229139
print(sqrt(mean(error_2^2, na.rm=TRUE)))
## [1] 12.39619
time_index <- seq(from = as.POSIXct("2015-07-01"), to = as.POSIXct("2016-09-30"), by = "day")
df<-cbind(as.data.frame(time_index),y=ASCII_ts['2015-07-01/2016-09-30'])
colnames(df)<-c('ds','y')
m <- prophet(df)
future <- make_future_dataframe(m, periods = 92)
forecast <- predict(m, future)
plot(m, forecast)
autoplot(ASCII_ts)